OC-IA-P6 - YELP - NLP PART

This notebook aims at finding insights in a set of restaurants reviews from Yelp.

Reviews loading

Data filepath

First look

The file is too big to be opened directly (kernel crashes). So let's first have a look at the first line:

What we need to collect is the 'text' section of each review. How many reviews do we have ? Since we can't load them directly we compute it indirectly:

Load dataset

We'll work with dask DataFrame to get a sample of ~0.1% reviews (about 8600 reviews) over all data.

Here are the first lines of this sample.

TFIDF creation

To process the reviews, ou computer needs to convert them first into numbers since it is the only thing it can handle. But how can we achieve this ? One way is list all the words in all the reviews, to build a vocabulary, then count how many times each word of a review is in the vocabulary. This is called "Bag of words" ("bag", since the order of words is not taken in account). Then, we can try to automatically group reviews that use the same words and infer that they are thus about the same topic.

The Bag of words (BOW) is efficient bu may have at least 2 flaws:

To achieve this, we will first compute the frequency of each term in all the reviews, then take its inverse, so that the more frequent the word is, the lower this inverse value. Then, for each review, the frequency of each term in this review (say, the word 'chicken') is multiplied by the inverse frequency of the word in the whole vocabulary. Since not all reviews will talk about chicken, the final weight of the word "chicken" in the review may be high enough to let us think that it is an important topic for this review.

This double weighting is called TFIDF (Term Frequency - Inverse Document Frequency).

Corpus cleaning and preprocessing

Before applying this process, we first need to clean the data to group close words. For example, we want the words 'Fried', "fry", 'fries", etc, to be grouped under a generic root 'fry', in lowercase. This process is lemmatization.

We also need to exclude too frequent words, such as "the", "in", etc. We can compute them manually, but we have a tool, Spacy, that has already computed the frequencies of all words in English language and can exclude them automatically (the stop words).

And of course, we need to separate the words in the text, excluding punctuation : this is also something that Spacy can handle.

Tokenization in Spacy is powerful enough to handle punctuation even if there are missing spaces (see example below), so our preprocess function will remain simple.

(But notice the "I've" that should have been tokenized as "I 've" since "It's" was properly tokenized as "It 's'".)

Let's process our sample:

Optimizing Tfidf Vectorizer

Now that our reviews are cleaned (no punctuation, no stop words, lemmatized forms), we can compute the TFIDF vectors, each of them representing a review in our dataset.

First attempt with default parameters

This is much too big. Most frequent words in English were dropped thanks to spacy stopwords. But among most rare words (including misspelled ones, how many would be dropped by tuning the min_df (minimum proportion of documents where a word must be found to be kept in vocabulary) ?

Optimising min_df parameter

Let's find how many words remain in the vocabulary if we exclude those used (in our dataset of ~8 600 reviews) by less than 5 reviews, 10 reviews, 50 reviews, etc...

Re-training with best min_df parameter

We pick a value of 0.001 for our min_df. This means that words that appear in less than 0.1% of reviews (= ~8 reviews in our sample dataset) will be discarded.

We end up with a matrix representing the weighted frequency of each vocabulary word in each review.

Let's compare first matrix line with first sample, from which it was created:

Can all words be found in sample ?

What proportion of sample words is left after processing ?

Visualisation

2D PCA / TruncatedSVD

A first attempt to apply PCA to our X sparse matrix gave following error message:

TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.

Extract of sklean doc for sklearn.decomposition.TruncatedSVD:

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

This is low but remember that our initial X matrix has many features!

This scatter's shape evokes the trajectory of particles expelled from a single point located on (0,0).

Moreover, we notice that the reviews scores seem to be reflected in the data (worst scores at the top of scatter plot), while the training and the dimension reduction were exclusively made on texts, not scores. Scores only have been added as a color in the plot after computing dimensional reduction.

And with 3 dimensions ?

What are these components made of?

It seems that we could describe these 3 components as follows:

How about next components? Let's look at the 10 first:

t-SNE

UMAP

We'll first reduce the number of dimensions with TruncatedSVD.

One can notice that some blobs group reviews about hair stylists, car cleaners / retailers, hotels. Other blobs are related to specific food. So topic modeling may allow to find out those specific topics.

Topic modeling with LDA

We can extract topics from our reviews. The main parameter is the number of topics: the highest the number, the more detailed topics we'll get.

For instance, with more than 15-20, we notice topics such as kids (cf section with 14 topics, topicc #7) or healthcare (), that were not noticeable with a lower number of topics.

At first we may think that TruncatedSVD / LSA gives more interesting results than LDA, since in Truncated SVD, the words with negative impact on variance help to betted understand the topic, while words with lowest scores in LDA components are just non-significant.

But word clouds show that some reviews are on other subjects than food, and LDA may help us to find them more easily.

To select the best number of topics, we have to look at each set of wordcloud and decide manually if it is relevant: for instance, 10 topics seem to be an interesting choice since we clearly see a 'haircut' topic, etc. But how could we measure more objectively the relevancy of this choice?

A silhouette score cannot be used here since the BOW concept doesn't know semantic similarity. We, humans, can confirm that it is relevant to link "hair" and "cut", because we know they are semantically related; while the only thing our BOW model knows is that these words are often seen together in our corpus, which is not the same thing.

To find the best number of components, a strategy could be to use a word embedding that takes semantic similarity in account, such as Word2vec (used by our Spacy model) or BERT, then compute a silhouette score of each 'clustering' (the number of components can be seen as a centroid number). But this is out of the scope of this project.

Sources:
https://towardsdatascience.com/2-latent-methods-for-dimension-reduction-and-topic-modeling-20ff6d7d547